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Abstract —In this paper, we present a novel way to summarize 
the structure of large graphs, based on non-parametric estimation 
of edge density in directed multigraphs. Following coclustering 
approach, we use a clustering of the vertices, with a piecewise 
constant estimation of the density of the edges across the 
clusters, and address the problem of automatically and reliably 
inferring the number of clusters, which is the granularity of 
the coclustering. We use a model selection technique with data- 
dependent prior and obtain an exact evaluation criterion for 
the posterior probability of edge density estimation models. 
We demonstrate, both theoretically and empirically, that our 
data-dependent modeling technique is consistent, resilient to 
noise, valid non asymptotically and asymptotically behaves as 
an universal approximator of the true edge density in directed 
multigraphs. We evaluate our method using artificial graphs and 
present its practical interest on real world graphs. The method is 
both robust and scalable. It is able to extract insightful patterns 
in the unsupervised learning setting and to provide state of the 
art accuracy when used as a preparation step for supervised 
learning. 

I. Introduction 

With the recent availability of much network data, such as 
world wide web, social networks, phone call networks, science 
collaboration graphs 0, |2|. there is a renewed interest for 
the graph partitioning problem, especially for the automatic 
discovery of community structures in large networks o, 0 , 
0 Beyond clustering approaches, coclustering approaches 
aim at summarizing the relation between two entities in a 
many-to-many relationship. Such a relation can be represented 
as a graph, where the source and target vertices represent 
entities and the edges stand for relations between entities. A 
coclustering model provides a summary of a graph by grouping 
source vertices and target vertices. For example, in market 
analysis , the source vertices of the graph represent customers, 
the target vertices represent products and there is one edge 
each time a customer has purchased a product. A coclustering 
model summarizes the dataset by grouping customers that 
have purchased approximately the same products and grouping 
products that have been purchased by approximately the same 
customers. Coclustering models have been applied to many 
other domains, such as information retrieval (the entities are 
documents and their words in a text corpus), web log analysis 
(cookies and their visited web pages), web structure analysis 
(web pages with hyper-links between them) or telecommuni¬ 
cation network (the call detail records stand for the edges in 
a call graph between a caller and a called party). All these 
real-world graphs are directed multigraphs, meaning that two 
entities may be linked by multi-edges. We aim to summarize 


and discover insightful patterns in such graphs, using a method 
with the desired following properties: 

1) Robustness, to avoid detecting spurious patterns in 
case of noisy data. 

2) Non asymptotic validity, to be able to detect reliable 
patterns with as few data as possible. 

3) Asymptotic convergence to the true underlying dis¬ 
tribution when it exists. 

4) Parameter less [] , with no user parameters to set. 

5) Scalable, to process large graphs. 

In this paper, we present a novel way of analyzing and 
summarizing the structure of large graphs, based on piecewise 
constant edge density estimation. We apply data grid models 
m to graph data, where each edge is considered as a statistical 
unit with two variables, the source and target vertices. The 
objective is to find a correlation model between the two 
variables by the means of a data grid model, which in this 
case turns out to be a coclustering of both the source and 
target vertices of the graph. The cells resulting from the cross- 
product of the two clusterings summarize the edge density 
in the graph. The best correlation model is selected using 
the MODL approach 0, and optimized by the means of 
combinatorial heuristics with super-linear time complexity. The 
MODL approach is an application of the Minimum Description 
Length (MDL) principle [8], specialized in the following way: 
use of hierarchical prior uniform at each stage of the hierarchy 
of the model parameters, use of discrete model parameters 
that are data dependent and use of combinatorial optimization 
algorithm to find the MAP (Maximum a Posteriori) model. 

Compared with our previous work that introduced data 
grid models 0, we suggest a new interpretation of these 
models as universal density estimators and apply them to 
directed multigraphs, with potential loops and multi-edges, 
by considering these graphs as distributions of edges with 
unknown density. We also show that in our model selection 
approach, the finite data sample is modeled directly, with a 
data-dependent prior distribution of the model parameters: this 
provides a non-asymptotic validity of the method. Further¬ 
more, we demonstrate new fundamental results that prove the 
consistency of the approach, with an asymptotic convergence 
to the true underlying distribution when it exists. Finally, 
we relate our method to existing approaches and present 
extensive comparative experiments. Throughout the paper, all 

Parameter-less means that there are no user parameter or hyper-parameter 
to set and that the model parameters are automatically inferred during the 
training process. 



the experiments are performed on a Windows PC with Intel 
Xeon W3530 2.8 Ghz, using the Khiops software for the 
graph coclusterings based on our approach. 


The rest of the paper is organized as follows. Section [II] 
explores related work. Section III reformulates the MODL 
method for data grid models in terms of finite data sample 
modeling and applies it to graphs. Section [IV] introduces the 
problem of edge density estimation, demonstrates the asymp¬ 
totically consistency of the MODL approach, and provides 
experimental results regarding the convergence rate. Section [V] 
points out the differences between our approach and three 
alternative methods, using artificial datasets in controlled com¬ 
parative experiments. Section [VI] shows that our method can 
be used both for exploratory analysis and as a preparation step 
for supervised learning, using three real world datasets. Finally, 
Section IY3 gives a summary and suggests future work. 


II. Related Work 

Many approaches aiming at summarizing and finding struc¬ 
tures in graphs have been proposed in the literature. In this 
section, we discuss several of these approaches and relate them 
to our approach. 


A. Graph clustering 


Many approaches have been studied for the problem of 
graph clustering, including hierarchical clustering, divisive 
clustering, spectral methods, random walk (for_a survey, see 
GO, [TOl). To evaluate the quality of a clustering [11 ] regardless 
of the cluster number, the modularity criterion iTHUl is now 
widely accepted in the literature and has even been treated as 
an objective function in clustering algorithms IT51 . 0. This 
criterion aims to obtain dense clusters where the within-cluster 
edge density is significantly above the expected edge density 
in case of random edges following the same vertex degree 
distribution. Actually, not all graphs follow a cluster tendency 
ED, with a structure consisting of natural clusters. Yet, all 
clustering algorithms output a partition into clusters for any 
input graph. While the clustering setting is relevant in many 
domains, our approach does not rely of such cluster tendency 
assumption and may have a wider range of application. This 
is illustrated experimentally in Section V-A| 


B. Blockmodeling 

More expressive graph models aim at searching a partition 
of both the source and target vertices into clusters, with 
different types of interaction between clusters. The cross- 
product of the two partitions of vertices form a partition of 
the edges into blocks or coclusters. This modeling approach 
is called blockmodeling and has been thoroughly studied 
for decades. In early approaches m, ge non-stochastic 
blocks are considered, with a focus on predefined types of 
block patterns. The blockmodel is searched either indirectly 
using a (dis)similarity measure between pairs of vertices and 
then applying a standard clustering algorithm, or directly by 
optimizing an ad hoc function measuring the fit of real blocks 
to the corresponding predefined types of blocks. The limit of 

2 Khiops is a general purpose data preparation and scoring tool available as a 
shareware at http://www.khiops.com which implements the MODL approach 
described in Section [TIT| 


these approaches is that they do not cope with the stochastic 
nature of many real world datasets. 

C. Stochastic blockmodeling 

Using the framework of the exponential family, stochastic 
blockmodels are introduced by Holland et al. fl7i . with 
blocks still specified a priori. The approach is extended by 
Wasserman and Anderson da to the discovery of block 
structure and exploits a statistical criterion, e.g. likelihood 
function, optimized using the EM algorithm. The method of 
Snijders and Nowicki ED considers blockmodels where the 
edge probabilities depend only on the blocks to which the 
vertices belong. The considered models are limited to two 
blocks, and searched via maximum likelihood estimation using 
the EM algorithm for small graphs and via Bayesian Gibbs 
sampling for larger graphs. The blockmodels are broadened to 
an arbitrary number of blocks f20L and optimized via Monte 
Carlo Markov Chain (MCMC) Bayesian inference. Karrer and 
Newman ED propose to include the degree distribution of 
the vertices as a correction to the blockmodels in order to 
better fit real world graphs. For a survey on recent work on 
stochastical blockmodeling via maximum likelihood methods, 
see E2 In the connected approach named “mixed member¬ 
ship blockmodels” [231, the mixture models are approximated 
owing to variational methods, which offer better scalability 
at the expense of approximating the objective function. Non- 
parametric extensions of the stochastic blockmodeling ap¬ 
proach have also been considered. In l24l . a Dirichlet process 
is exploited as a prior for partitions of any size, to cluster 
both the source and target vertices of a directed simple graph, 
where the edges within each block are generated according to 
a Bemouilli distribution. Still, hyper-parameters are required 
in these approaches, such as the concentration parameter in 
the Dirichlet process, which influences the expected number 
of clusters in the non-asymptotic regime. Our approach is 
closely related to non-parametric ^ stochasticl blockmodeling 
approaches. It differs from existing approaches on the main 
following points: it is not restricted to simple graphs, the model 
selection method exploits a MAP (maximum a posteriori) 
approach with an exact analytical criterion, not a Bayesian 
approach aiming at approximating the posterior distribution of 
the models, it is parameter-less, with no user parameter to set, 
and it exploits scalable optimization heuristics. A comparative 
experiment with a sta tistical block modeling method is pre¬ 
sented in Section IV-BI 

D. Coclustering 

A directed multigraph is fully described by its adjacency 
matrix, where each entry of the matrix contains the number of 
edges between a source vertex (in a row) and a target vertex (in 
a target). This is equivalent to the contingency table between 
two categorical variables in a dataset. Such contingency tables 
can be summarized using coclustering methods. In the applied 
mathematics field, the seminal work of [26l treats the problem 
of coclustering of a numerical matrix, by looking at a partition 
of the rows and columns of the matrix. In the data mining field, 
in case of binary variables, this technique has been applied to 
the simultaneous partitioning of the instances into clusters and 

3 Here, we use the term non-parametric like in ['25 ) for models where the 
number of parameters is not fixed and may grow with the sample size. 










of the variables into groups of variables ( 23 , with methods like 
(28) closely related to the stochastic blockmodeling approach. 
Coclustering has also been applied to the domain of gene 
expression data (29) . (30) by minimizing the sum squared 
residue to approximate a numerical matrix. The method of 
ED optimizes a minimum loss of information to summarize a 
binary distance, with an application to text mining. While most 
approaches require the number of row and column clusters 
as user parameters, the method of (32) is parameter-less, 
by directly optimizing the Goodman-Kruskal’s r measure of 
association between two categorical variables. Like the method 
of E2L our approach focuses on contingency tables containing 
frequency data and is parameter-less. However, our method, 
being fully regularized, is both resilient to over-fitting and able 
to approximate the true joint distribution when it exists. 

E. Minimum description length based methods 

In the method of E2L the MDL principle (8), EH is 
employed for the inference of the whole set of blockmodel 
parameters, including the number of blocks, in case of directed 
simple graphs. Using a two-part scheme for encoding for the 
model parameters and the data given the model, a parameter¬ 
less regularized criterion is obtained, inheriting from the MDL 
method’s resistance to over-fitting. The criterion is optimized 
using a greedy top-down heuristic, by adding one cluster at a 
time from a single-cluster initial model, and optimizing each 
coclustering model by moving values across clusters. In (351 . 
the MDL method of [321 is extended to the case of spatial data 
mining. The method of (36) is dedicated to undirected simple 
graph, and the objective function is optimized using simulated 
annealing. Following these MDL methods, several encoding 
schemes are explored in [ 37] in the case of undirected simple 
graph to study their resistance to over-fitting. A fast multi¬ 
level algorithm is exploited to generate candidate partitions 
of the vertices of varying sizes, with a focus on the single 
versus multiple cluster question. The study shows that earlier 
approaches tend to over-fit the data with more than one cluster 
in case of random graphs, especially in case of skewed degree 
distribution of the vertices. The early method of E3, which 
applies to directed simple graphs, is the closest to our method. 
The main differences are that our method (1) can be applied to 
directed multigraph, (2) relies on an exact analytical criterion 
without any asymptotic approximation (such as using empirical 
entropies to encode the data, like in previous MDL-based 
methods), (3) is valid both in the non-asymptotic and asymp¬ 
totic case, and (4) exploits a bottom-up greedy optimization 
heuristic. The impact of these differences both on artificial 
and real-world datasets is assessed in Sections IV-CI and [VD 

F. Alternative binary matrix summarization approaches 

Other approaches have been proposed to extract patterns 
from binary datasets. For example, a tile (38) is a region of 
a database defined by a subset of rows and columns with a 
high density of 1, and a collection of tiles constitutes a tiling. 
A tile is then closely related to one single dense cocluster, 
and a tiling to a coclustering, although the tiling is not a 
partition of the database. In am m, the maximum entropy 
principle HD is applied with row and columns marginals as 
prior information, leading to an interestingness_measure of 
tiles and a method for extracting a tiling. In (42l . the same 


framework is applied for extracting multi-relational patterns: 
by representing multi-relational data as a K-partite graph, ex¬ 
tracting complete connected subgraphs reduces to the problem 
of extracting tiles. Coclustering has also been extended by 
considering a hierarchy of coclustering, where each cocluster 
is itself partitioned into a set of sub-coclusterings. In [43), 
the MDL method of (33l is extended to find such patterns 
automatically, whereas in 144) . the problem is treated using 
the Mondrian process, a multidimensional generalization of the 
Dirichlet process. Tiling has also been extended to hierarchies 
of tiling in (45). In (46) . a binary matrix is decomposed 
as a product of Boolean factor matrices; extending standard 
matrix factorization methods, the proposed approach allows 
a better interpretability of the extracted patterns. Compared 
with these alternative pattern extraction approaches in binary 
matrices, our method focuses on the problem of coclustering 
of a contingency matrix (or adjacency matrix of a directed 
multigraph), which is a matrix of counts. 

III. MODL Approach for Graphs 

Data grid models (6) have been introduced for the data 
preparation phase of the data mining process (47) . which is 
a key phase, both time consuming and critical for the quality 
of the results. They allow one to automatically, rapidly and 
reliably evaluate the class conditional probability of any subset 
of variables in supervised learning and the joint probability 
in unsupervised learning. Data grid models are based on a 
partitioning of the values of each variable into intervals in the 
numerical case and into groups of values in the categorical 
case. The cross-product of the univariate partitions forms a 
multivariate partition of the representation space into a set of 
cells. This multivariate partition, called the data grid, can be 
interpreted as a piecewise constant non-parametric estimator 
of the conditional or joint probability. The best data grid is 
searched using a MAP approach and efficient combinatorial 
heuristics. The method is non-parametric in the statistical 
sense, since it does not rely on the assumption that the data 
are drawn from a given probability distribution. It is also 
parameter-less, since all the model parameters, which number 
grows with the sample size, are automatically inferred without 
any user parameter. 

A. Basic Notions of Graph Theory 

A graph G = ( V,E ) consists of a set V of vertices and a 
set E of pairs of vertices called edges. A graph is undirected 
if the edges are unordered pairs of vertices, and is directed if 
the edges are ordered. A loop is an edge from one vertex to 
itself. A graph is simple in case of at most one edge per pair 
of vertices, and is a multigraph otherwise. 

Two vertices of an undirected graph are called adjacent if 
there is an edge connecting them. An edge is incident to its 
two vertices, called extremities. The degree of a vertex is the 
number of edges incident to it. In case of directed graph, the 
extremities of an edges are called the source and target vertices 
of the edge, the in-degree of a vertex v is the number of edges 
with target v, and the out-degree of v is the number of edges 
with source v. 

Graphs can be represented by their adjacency matrix, 
where each cell of the matrix contains the number of edges 




per pair of vertices. The adjacency matrix of simple graphs 
contain only binary values, and that of undirected graphs is 
symmetrical. Figure [I] displays an example of directed simple 
graph. Figure [2] displays a directed multigraph with self-loops, 
as well as it adjacency matrix and the in and out-degrees of 
each vertex. 

B. MODL Criterion for Graphs 

We reformulate the data grid approach in the context of 
edge density estimation in directed multigraphs. As shown in 
Figure [T] a directed graph can be represented in a tabular 
format with two variables, source vertex and target vertex, 
and one line per edge described by its two vertices. We can 
then apply the data grid models in the unsupervised setting to 
estimate the joint density between these two variables, which 
is the density of edges in the graph. 

In this section, we formulate the approach as a modeling 
of a finite data sample, where the model parameters aim to 
summarize the edge counts in the sample. 


Source 

Target 

A 

D 

A 

F 

B 

A 

B 

C 

B 

D 

D 

G 

F 

G 

G 

E 



Fig. 1. Directed simple graph and its tabular representation. 

Let S and T be a source and target set of vertices, with 
ns = \S\ source vertices and nr = |Tj target vertices, and let 
G be a directed multigraph with m edges from S' to T. Given 
S, T and m, our objective is to provide a joint description of 
the source and target vertices of the edges in the graph. One 
simple way to describe the edges exploits the tabular format 
shown in Figure [T] with the count of edges per pair (Source, 
Target) of vertices. We can also summarize the location of 
edges at a coarser grain by introducing clusters of source 
vertices and clusters of target vertices, and considering the 
number of edges per pair of source and target cluster (coclus¬ 
ter). Such a coclustering model provides a summary of the 
graph. The coarsest summary is based on one single cluster of 
vertices with just the total number of edges, whereas the finest 
summary exploits one cluster per vertex, with the number of 
edges per pair of vertices. Coarse grained summaries tend to be 
reliable, whereas fine grained summaries are more informative. 
The issue is to find a trade-off between the informativeness of 
the summary and its reliability, on the basis of the granularity 
of the coclustering. 

For given sets of source and target vertices S, T and a given 
number of edges m (sample size), we exploit a family of graph 
coclustering models, formalized in Definition [T] 

Definition 1: A graph coclustering model is defined by: 

• the numbers ks,kr of source and target clusters of 
vertices, 

• the partition of the ns source vertices into the ks 
source clusters, resulting in nf vertices per cluster, 

1 <i<k s , 


• the partition of the nr target vertices into the kr target 
clusters, resulting in nJ vertices per cluster, 1 < j < 
kr, 

• the distribution of the m edges of the graph G 

on the kE = coclusters with edge counts 

«}k M t per cocluster, 

• for each source cluster of vertices i, 1 < i < ks, 
the distribution of the mf edges originating in source 
cluster i on the nf vertices of the cluster, i.e. the out- 
degrees {m^}i<^< ns per source vertex, 

• for each target cluster of vertices j, 1 < j < ks, 
the distribution of the m T j edges terminating in target 
cluster j on the nj vertices of the cluster, i.e. the 
in-degrees {m.j}i<j<n T P er target vertex. 


TABLE I. Notation 


S, T 

ns = \S\ 
n T = \T\ 

G 

m 

source and target vertex sets 

number of source vertices 

number of target vertices 

directed multigraph with edges from S to T 

number of edges in G 

M 

graph coclustering model for given S,T,m 

ks, kT 

number of clusters of source and target vertices 

k>E = kskT 

number of coclusters of edges 

< T 

number of edges for cocluster (i,j) 

mi. 

number of edges for source vertex i (out-degrees) 

m.j 

number of edges for target vertex j (in-degrees) 

nf 

number of vertices in source cluster i 

nj 

number of vertices in target cluster j 

mf 

number of edges originating in source cluster i 

T 

number of edges terminating in target cluster j 

mij 

number of edges for pair (i,j) of vertices 


This notation, summarized in Table |T| is illustrated in 
Figure [2| where a directed multigraph is displayed with its 
adjacency matrix. A clustered version of this graph is presented 
in Figure [3] which results in a coclustering of its adjacency 
matrix. 

ABCDEFG 
A |0|1|0|0|0|0|0| 1 
B 0 0 1 0 0 0 1 2 

C 0 0 1 0 0 0 1 2 

D 0 1 0 0 1 0 0 2 

E D 0 1 0 0 0 1 2 

F00002002 
G 0 0 1 0 0 0 1 2 

£ 0240304 13 

Fig. 2. Directed multigraph and its adjacency matrix. The numbers rriij in 
the adjacency matrix are the numbers of edges for each pair of vertices (for 
example, two edges from F to E). The sums mi. on the right column are 
the out-degrees of the vertices, and the sums m.j on the bottom line are the 
in-degrees of the vertices. The total number of edges is on the bottom right 
corner of the adjacency matrix. 

We assume that the numbers of edges m and of source and 
target vertices ns and are known in advance and we aim 
to model the m edges of G between these two sets of vertices. 
This setting is general enough to account for directed graphs, 
bipartite graphs and undirected graph, where each edge comes 
twice with the two directions. 

The family of models introduced in Definition [I] is com¬ 
pletely defined by the numbers ks and kr of clusters, the 
partitions of vertices into clusters, the edge counts in each 
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Fig. 3. Directed multigraph with two source and three target clusters. The 
adjacency matrix of the graph (reorganized by clusters) is presented on the 
top-left, and that of the clustered graph at the bottom. The numbers m? T in 
the clustered adjacency matrix are the numbers of edges for each cocluster 
(for example, 5 edges from SI to T2). 


coclusters {mf J T }i<i<k s ,i<j<k T ^ an d the out- and in-degree 
of each vertex 

The numbers of vertices per cluster nf and nj are derived 
from the specification of the partitions of vertices into clusters: 
they do not belong to the model parameters. Similarly, the 
numbers of edges originating or terminating in each cluster can 
be deduced by adding the frequencies of coclusters, according 
to mf = m i Y an d mT j Ylih m ij T - 

In order to select the best model, we apply a MAP 
approach. We suggest the prior distribution of the model 
parameters described in Definition^ by applying the following 
modeling choices: use of discrete rather than real-valued distri¬ 
butions, of the “natural” hierarchy of the model parameters and 
choice of uniform distributions at each level of the hierarchy, 
to be as uninformative as possible. 


{mi,}, {rri'j} in the model are exactly the same as the related 
empirical edges counts {M^ T }, {Mi}, {Mj} in the graph 
sample. 

A model which is not consistent with the data cannot 
generate the data and obtains a null posterior probability. We 
now focus on the posterior probability of consistent models to 
obtain the evaluation criterion given in Theorem □ 1 a. 

Theorem 1: The negative log of the posterior probability of 
a graph coclustering model M consistent with a graph sample 
G, distributed according to a uniform hierarchical prior, is 
given by 


c(M) = log ns + log nr 

+ log B(n s , k s ) + log B(n T , k T ) 
— I s 
kE ~ 1 


+ log 


ks 


E lo s 


mf + nf — 1 
nf-l 


i=1 

M / T i T i 

, (m ■ + n 1 - - 1 

Vi 

j = l x 3 

k s k T 

log to! - log mfj r l 

i= 1 3=1 

ks ns 

^ log mf ! — log mi, ! 

i=1 i=l 

k,T nT 

E log mji - log 

3 = 1 3 =1 


m.jl 


(la) 

(lb) 

(lc) 

(l d) 

(le) 

(l f) 

dg) 

(lh) 


Definition 2: The prior for the parameters of a graph 
coclustering model are chosen hierarchically and uniformly at 
each level: 

• the numbers of clusters ks and kr are independent 
from each other, and uniformly distributed between 1 
and ns for the source vertices, between 1 and for 
the target vertices, 


B(n,k) is the number of divisions of n elements into 
k subsets (with potentially empty subsets). When n = k, 
B(n , k) is the Bell number. In the general case, B(n, k ) can 
be written as B(n,k ) = Yli=i S(n,i), where S(n,i) is the 
Stirling number of the second kind |48|, which stands for the 
number of ways of partitioning a set of n elements into i 
nonempty subsets. 


• for a given number ks of source clusters, every parti¬ 
tion of the ns vertices into ks clusters is equiprobable, 

• for a given number kr of target clusters, every parti¬ 
tion of the nr vertices into kr clusters is equiprobable, 

• for a model of size (ks, kr), every distribution of the 
m edges on the kE = kskr coclusters is equiprobable, 

• for a given cluster of source (resp. target) vertices, 
every distribution of the edges originating (resp. ter¬ 
minating) in the cluster on the vertices of the cluster 
is equiprobable. 


Mainly, the evaluation criterion of Theorem [I] relies on 
counting the number of possibilities for the model parameters 
and for the data given the model. In Equation [I] line (la) 
relates to the prior distribution of the cluster numbers ks and 
kr and line (lb) £] to the specification of the partition of the 
source (resp. target) vertices into clusters. These terms are the 
same as in the case of the MODL supervised univariate value 
grouping method |[49l . Line (Tcj ) [^] represents the specification 
of the parameters of the distribution of the m edges on the 
kE coclusters, followed in line © (resp. line ( fle| )) by the 
specification of the distribution of the edges originating (resp. 
terminating) in each cluster on the vertices of the cluster. Line 


Linally, we now introduce the notion of consistency of a 
graph coclustering model with a graph sample in Definition [3] 

Definition 3: Lor given sets of vertices S, T and number 
of edge m, a graph coclustering model M is consistent with 
a graph sample G if and only if the edge counts {mf^}. 


4 For the choice of an integer parameter k uniformly distributed between 1 
and n, we get p(k) = —, leading to — log k = log n. 

c n 

3 For a partition of n elements into k subsets chosen unifomly among 
B(n , k) possibilities, we get log B(n, k ) terms in the criterion. 

6 The number of positive integer parameters (n\,,nf) with Ef=i n i = 

sCT-7 1 )- 































®Q stands for the likelihood of the distribution of the edges 
on the coclusters, by the means of a multinomial coefficient. 
Finally, line fig] ) (resp. line ( [Th] )) corresponds to the likelihood 
of the distribution of the edges originating (resp. terminating) 
in each cluster on the vertices of the cluster. 


As negative log of probabilities are code lengths, our model 
selection technique is similar to a practical minimum descrip¬ 
tion length principle GO, El with two-part MDL code. Our 
method is valid non-asymptotically, since it directly encodes 
the edge counts of the data sample. The inferred MAP model is 
necessarily consistent with the data sample: it provides a valid 
summary of the edge counts in the graph sample, but without 
any asymptotic guarantee w.r.t. the underlying edge probability 
distribution. In Section IV we study the asymptotic behavior 
of the approach as the number of edges in the data sample 
goes to infinity, and demonstrate that it can be interpreted as 
an edge density estimator with asymptotic convergence to the 
true edge distribution when it exists. 


Algorithm 1 Greedy Bottom Up Merge heuristic (GBUM) 
Require: M {Initial solution} 

Ensure: < c(M) {Final solution with improved 

cost} 

1: M* M 

2: while improved solution do 

3: M' <- M* 

4: for all merge between two source or target clusters do 

5: {Consider merge for model M*} 

6: M + <— M* + merge 

7: if c(M+) < c(M') then 

8: M' <r- M+ 

9: end if 

10: end for 

ll: if c(M') < c(M*) then 

12: M* <— M' {Improved solution} 

13: end if 

14: end while 


C. Optimization Algorithm 

Graph coclustering models are no other than data grid 
models 0 applied to the case of joint density estimation 
of the source and target vertices of the edges. The space of 
data grid models is so large that straightforward algorithms 
almost surely fail to obtain good solutions within a practicable 
computational time. Sophisticated heuristics are described in 
m to optimize the criterion c(M). They finely exploit the 
sparseness of the adjacency matrix of the graph and the 
additivity of the criterion, and allow a deep search in the model 
space with 0(m) memory complexity and 0(m^/mlogm) 
time complexity. 

In this section, we give an overview of the optimization 
algorithms which are fully detailed in (51, and rephrase them 
using the graph terminology. The optimization of a data grid 
is a combinatorial problem. The number of possible partitions 
of n vertices is equal to the Bell number B(n) = \ jt- 

Even with very simple models having only two clusters of 
source and target vertices, the number of models involves 
2 ns+nT coclusterings of the vertices. An exhaustive search 
through the whole space of models is unrealistic. We describe 
in Algorithm [I] a greedy bottom up merge heuristic (GBUM) 
which optimizes the model criterion c(M). The method starts 
with a fine grained model, with few vertices per source or 
target cluster, up to the maximum model My [ ax with one 
vertex per source or target cluster. It considers all the merges 
between clusters (independently for the source and target sets 
of vertices), and performs the best merge if the criterion 
decreases after the merge. The process is reiterated until no 
further merge decreases the criterion. 

Each evaluation of the criterion for a model requires 0(n 2 ) 
time, since the initial model contains up to nsriT coclusters 
(see Equation 0 ) in the case of the maximal model Ad] yiax* 
Each step of the algorithm relies on 0(n 2 ) evaluations of 
merges of clusters of vertices, and there are at most 0(n) steps, 
since the model becomes equal to the null model M$ (one sin¬ 
gle cluster) once all the possible merges have been performed. 
Overall, the time complexity of the algorithm is 0(n b ) using 

7 The number of ordered partitions of n elements into k subsets of sizes 
(ni,..., rife) is given by the multionomial coefficient. 


a straightforward implementation of the algorithm. However, 
the method can be optimized in 0(my/m log m) time. The 
optimized algorithm mainly exploits the sparseness of the data, 
the additivity of the criterion and starts from non-maximal 
models with pre and post-optimization heuristics. 

• Large graph are often sparse, with far less edges than 
in complete graphs. Although a model may contain 
0(n 2 ) coclusters, at most m coclusters are non empty. 
Since the contribution of empty coclusters is null in 
the criterion [T] each evaluation of a data grid can be 
performed in 0(m) time owing to specific algorithmic 
data structures (mainly, sparse representation with fast 
access to edges via hash-indexes). 

• The additivity of the criterion means that it can be 
decomposed on the hierarchy of the components of the 
models: extremity (sources vs target variable), cluster 
of vertices, cocluster. Using this additivity property, all 
the merges between adjacent clusters can be evaluated 
in 0(m) time. Furthermore, when the best merge 
is performed, the only impacted merges that need 
to be reevaluated for the next optimization step are 
the merges that share edges with the best merge. 
Since the graph is potentially sparse, the number of 
reevaluations of models is small on average. 

• Finally, the algorithm starts from initial fine grained 
solutions containing at most 0(y/m) clusters. Specific 
pre-processing and post-processing heuristics are ex¬ 
ploited to locally improve the initial and final solutions 
of Algorithm[I]by moving vertices across clusters. The 
post-optimization algorithms are applied alternatively 
to the source and target vertex variables, for a frozen 
partition of the other variable. This allows one to keep 
a 0(m) memory complexity and to bound the time 
complexity by 0(my/m log m). 

Sophisticated algorithmic data structures are necessary to 
exploit these optimization principles and guarantee a time 
complexity of 0(m^/m log m) for initial solutions exploiting 
at most 0(y/m) clusters of vertices. 

The optimized version of the greedy heuristic is time 








efficient, but it may fall into a local optimum. This problem is 
tackled using the variable neighborhood search (VNS) meta¬ 
heuristic 150], which mainly benefits from multiple runs of 
the algorithms with different random initial solutions. The 
main heuristic described in Algorithm [I] with its guaranteed 
time complexity, is used to find a good solution as quickly 
as possible. The VNS meta-heuristic is exploited to perform 
anytime optimization: the more you optimize, the better the 
solution. To favor quality over speed, the meta-heuristic default 
setting is to perform at least 10 rounds of the main optimization 
heuristic before stopping. 

The optimization algorithms summarized above have been 
extensively evaluated dD, using a large variety of artificial 
datasets, where the true data distribution is known. Overall, the 
method is both resilient to noise and able to detect complex fine 
grained patterns. It is able to approximate any data distribution, 
provided that there are enough instances in the train data 
sample. 


IV. Consistency of the Approach for Edge 
Density Estimation 

In this section, we interpret the models presented in Sec¬ 
tion [III] as edge density estimators, demonstrate that the MODL 
approach converges asymptotically to the true edge density 
distribution when it exists. We also provide experimental 
results regarding the convergence rate of the approach. 


A. Edge Density Estimation 


be treated with symmetrical edge probabilities and a pair of 
directed edges per undirected edge jj . 

Given this random graph generative model, the problem 
is to estimate the edge densities in the graph from a finite 
data sample. Estimating the n 2 edge probability parameters 
Pij from a sample of size m is not an easy task, especially in 
the case of sparse graphs. 


B. MODL Approach for Edge Density Estimation 

In the following, we propose a new interpretation of our 
approach described in Section [III] and show how it reduces to 
a finite sample modeling, which asymptotically converges to 
an estimation of the edge density parameters. 

Given a graph coclustering model M as defined in Sec¬ 
tion [m| let us introduce the following notation for the proba¬ 
bility of the edges in a coclustered random graph: 

• {p S Jh< K < k sA<x< k T-- probability distribution of the 
edges falling in each cocluster (ft, A) 

• {PK,i.}k s (i)=K : probability distribution of the out- 
degrees of the vertices i of the source cluster ft 

• {PA,.j}/c T (j)=A : probability distribution of the in¬ 
degrees of the vertices j of the target cluster A 


Using this notation, the probability parameters of a coclus¬ 
tered random graph can be empirically estimated from the edge 
count parameters in a model M according to: 
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In our approach, we consider the graphs as generative 
models, where the statistical units are the edges with two 
variables per edge, the source and target vertices of the edge. 
Whereas most blockmodeling approaches deal with simple 
graphs, focusing on their topology with at most one edge per 
pair of vertices, we regard graphs as statistical distributions 
of directed edges, with potential loops and multi-edges. A 
graph edge density model for a set of n vertices is entirely 
defined by a set of probability parameters {pij}i<i< n ,i<j<n> 
where p^ stands for the probability of each independent and 
identically distributed (i.i.d) edge having source vertex i and 
target vertex j. Given these settings, a graph G containing 
m edges is treated as a sample of size m drawn from the 
edge distribution. Therefore, large samples tend be produce 
complete graphs from a pure topological point of view, but 
with varying edge densities taking into account the generative 
model. 

This edge density model applies to much real world graph 
data. In web log analysis, it seems natural to consider a 
bipartite graph, with users as source vertices, web pages 
as target vertices and edges representing web navigation. A 
sample graph corresponds to an extract of web log data, with 
the popular pages much more seen than the others. In a phone 
call network, each edge represents one phone call from a caller 
vertex to a called vertex, so that two vertices can be connected 
by multi-edges. Collecting the phone calls during a given 
time period corresponds to a sample of a directed multigraph, 
where the potential communities correspond to subgraphs with 
high multi-edge density. The case of undirected graphs can 


This is a piece-wise constant modeling of the edge density 
with respect to the coclusters, constrained by the distributions 
of the in and out-degrees of the vertices in each cluster. 
Assuming the independence between the source and target 
vertices of the edges inside each cocluster, we get the following 
estimation of the edge densities: 
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where (ft, A) is the cocluster containing the edge (i, j). 


For the null model M$ with one single cluster (ft = A = 1), 
we have 

ST i m i- 171 3 m -3 

Pn = 1, Pi,i. = -, Pi,.j = —Pij = -(4) 

m m mm 

which means that the joint probability distribution p^ is the 

product of the two independent marginal distributions of the 

in and out-degrees of the vertices. 


For the maximal model M]y[ ax with one cluster per vertex, 
we have 
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(5) 


which means that the joint probability distribution p^ of the 
edges is directly estimated by the model parameters. 


8 Although our approach can deal with undirected graphs using both 
directions, such graphs would benefit from a specialized prior (potentially 
very different) and from simpler optimization algorithms (no need for alternate 
optimization of the partitions, for example). 





C. Asymptotic Convergence of the MODL Approach 

The family of coclustered random graphs is very expres¬ 
sive and can theoretically approximate any edge distribution 
provided that there is sufficient data. The problem is to select 
the best model given the data. 

The MODL approach aims to model directly the finite 
data sample, and exploits a discrete model space of the edge 
counts in the sample graph. Working on a set of parameters 
of finite size allows one to define a “natural” hierarchical 
prior with uniform distribution at each level of the hierarchy, 
as in Definition [2| and reduces to counting in this discrete 
model space. This data-dependent modeling technique leads 
to the criterion of Equation [TJ which can be interpreted as 
the exact posterior probability of the sample graph given the 
model (Bayesian interpretation), or the exact code length of 
the model parameters and edges given the model (two-part 
MDL interpretation (8), lf34l ). Therefore, the criterion does not 
rely on empirical estimation of continuous-valued parameters 
(such as probabilities or entropies), which are valid only 
asymptotically. We now study whether for given sets of source 
and target vertices, this exact finite data sample modeling 
asymptotically converges towards the true edge density when 
it exists, as the edge number goes to infinity. 

Let us first recall some concepts from information theory. 
The Shannon entropy H(X) il5Tj of a discrete random variable 
X with probability distribution function p is defined as: 

H(X) = - ^2 P( x ) l°g p( x )- ( 6 ) 

The mutual information of two random variables is a quantity 
which measures the mutual dependence of the two variables 
ED; it vanishes if and only if they are independent. For two 
discrete variables X and Y, the mutual information is defined 
as: 

I(X;Y) = Y p ( x ’ y ^ log TYT\ ’ ( ? ) 


variables. Since the mutual information of two variables is 
not other than the Kullback-Leibler divergence ED between 
the joint probability distribution of two variables and their 
independent joint distribution, this means that the best selected 
coclustering tends to highlight contrasts between the two 
variables, being as far as possible from their independent joint 
distribution. 

We now present an important result in Theorem [3j which 
shows that the MODL approach asymptotically converges 
towards the estimation of the true edge distribution, which is 
the joint distribution of the source and target vertex variables. 
Although the modeling technique is data-dependent (regarding 
the model space and the prior on the model parameters) 
and aims to model exactly the data sample with a discrete 
distribution of the sample edges on the vertices, not the true 
edge continuous-valued probability distribution, this theorem 
demonstrates the consistency of the approach. 

Theorem 3: The MODL approach for selecting a graph 
coclustering model M asymptotically converges towards the 
true edge distribution, and the criterion for the best model 
M Bes t converges to m times the entropy of the edge variable, 
which is the joint entropy of the source and target vertices 
variables. 

lim =H (Vs,Vt). (9) 

m—>- oo 777, 

Proof: See Appendix. ■ 

As a corollary of Theorem [3j Theorem [4] states that the 
MODL approach allows one to estimate the mutual informa¬ 
tion between the source and target vertices variables. 

Theorem 4: The MODL approach for selecting a graph 
coclustering model M asymptotically converges towards the 
true edge distribution, and the criterion for the null model 
minus the criterion for the best model M Bes t converges to 
m times the mutual entropy of the source and target vertices 
variables. 


where p(x, y ) is the joint probability distribution function of 
X and Y, and p(x) and p(y) are the marginal probability 
distribution functions of X and Y respectively. 

Let us consider edges as statistical instances, with two 
vertex variables Vs and Vr having ns and nr values, and 
two vertex cluster variables Vg 4 and VjK having ks and kr 
values for a given coclustering model M. 

We present in Theorem [2] an asymptotic approximation of 
the evaluation criterion c(M) introduced in Equation [I] 

Theorem 2: The MODL evaluation criterion (Equation [lj 
for a graph coclustring model M is asymptotically equal to 
m times the entropy of the source and target vertex variables 
minus the mutual entropy of the variables grouped. 

c(M) = m ( H(V S ) + H(V t ) - I(V S M -, V?)) + O(logm). 

( 8 ) 

Proof: See Appendix. ■ 

As the criterion has to be minimized, this means that 
the method aims to select a coclustering model which maxi¬ 
mizes the mutual information between the two vertex cluster 


lim 
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c(M 0 ) - c(M Best ) 


m 


= I(V S ;V T ). 


( 10 ) 


Let us recall that the mutual information I (Vs; Vr) is null 
in case of independent source and target vertices, such as for 
Erdos-Renyi random graphs (52). Theorem [4] shows that for 
such graphs, the best selected model will be asymptotically the 
same as the null model (which actually represents the case of 
independence). Since the MODL approach is regularized, with 
prior terms in criterion c(M ) that grows with the granularity 
of the clusters, we expect the approach to select the null model 
in case of independence, even in the non-asymptotic case. This 
expected behavior is confirmed experimentally in Section [V] 

D. Experimental Convergence Rate of the MODL Approach 

We have shown that although the MODL approach aims to 
model the data sample directly, it asymptotically converges 
towards the true edge density. The assumption behind the 
MODL approach is that the non-parametric edge density 
estimation will benefit from fine tuned finite data-dependent 
model space and prior, so as to converge as fast and reliably 
as possible. 






Fig. 4. Convergence of the MODL approach on the random circular graph 
with 100 vertices: number of clusters and difference between the estimated 
and true mutual information I (Vs; Vt) per edge number in the graph sample. 


This convergence rate is hard to analyze theoretically in 
the non-parametric setting, without any assumption regarding 
the true edge density. For example, in the simple case of a 
cluster-based graph, the adjacency matrix is block-diagonal 
and most of the edge probabilities are null. In this case, few 
parameters need to be estimated and the convergence is fast. In 
this section, we chose a more difficult sample graph where the 
distribution of the edge probabilities is rather smooth with no 
cluster-based structure, unbalanced and never null, and present 
an experimental study of the convergence rate of the approach. 


Let us introduce circular random graphs as directed 
multigraphs, where the n vertices lie equidistant on the unit 
circle at positions (pci = cos , yi = sin ). The Eu¬ 
clidian distance between two vertices i and j being = 
yj(xi — Xj ) 2 + (yi — i/j ) 2 , which we extend by continuity to 
d(i,i) = ^ for self-loops, we define the probability of having 
an edge between two vertices in inverse proportion of their 
distance, according to: 


Vi,j, Pij 
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In such a circular random graph, the largest edge probabil¬ 
ities (related to self-loops with d = are n times larger than 
the smallest ones (related to pairs of vertices on a diameter 
of the circle, with d = 2), with a continuous decrease of 
edge probabilities in inverse proportion to the distance of their 
extremities. 

In our experiment, we chose a circular random graph with 
n = 100 vertices and randomly generate edges from sample 
size varying from 100 to 10 6 . For each sample size, we run the 
MODL algorithm and collect both the number of clusters and 
the mutual information estimated according to the data grid 
edge probability estimator (see Equation [3]) a nd to the MODL 
criterion (see Equation [10] from Theorem |4j). For comparison 
purpose, we also report the true mutual information (known 
exactly for this artificial dataset), as well as its empirical 
estimation (using p^ = ^h±) and according to the Laplace 
estimator (p^ = 

The experiment is repeated 100 times, so as to estimate 
both the mean and the standard deviation of the collected 
measures, presented in Figure [4] The empirical estimator 
tends to over-fit the data (the mutual information is largely 
overestimated, especially for small sample sizes), whereas the 
Laplace estimator tends to under-fit the data. In accordance 
with Theorem]?] the MODL criterion of Equation [T0| converges 
towards the true mutual information. The MODL(LH) criterion 
(likelihood terms only, without prior terms as in Equation [TO] ) 


related to the best selected model exhibits a much faster 
convergence rate, while not over-fitting the data. 

Figure [?] shows three phases in the convergence of the 
MODL approach. In the first phase ( stability phase), the 
number of edges is not sufficient to reliably estimate the 
edge probabilities, and the approach evaluates the random 
graph with one single cluster as being the most probable. 
In the second phase (non-parametric estimation phase), the 
method identifies regularities in the graph by building clus¬ 
ters and approximating the true mutual information, with an 
increased accuracy as the sample size grows. In the third 
phase (classical estimation phase), the method has built one 
cluster per vertex and estimates all the n 2 edges probabilities 
simultaneously according to a classical empirical estimation of 
a set of multinomial parameters: the accuracy of the estimation 
“classically” increases with the sample size. In this example, 
the non-parametric estimation phase starts when the number of 
available edges is about 500, about ^ of the number of edge 
probability parameters, and has converged at about 150,000 
edges, fifteen times the number of edge probability parameters. 
It is noteworthy that in most large real world graphs, the 
method will run in the non-parametric estimation phase. 

Overall, this experiment shows that the method is reliable, 
quickly discovers true regularities in the graph as soon as there 
is sufficient data, and is able to approximate complex edge 
density distributions with a fast convergence rate. 

V. Comparative Experiment on Artificial 
Datasets 

In this section, with exploit artificial datasets, where the 
true edge distribution is known, to compare our approach to 
alternative methods. We first relate our method to a graph 
clustering method, then to a statistical blockmodeling method, 
and finally to a parameter-less coclustering method based on 
MDL. 

A. Comparison with Graph Clustering 

This section, intended to be of a tutorial nature, points out 
the difference between graph clustering and our method, which 
exploits a coclustering approach. We first recall the modularity 
criterion which is widely used for graph clustering methods, 
then illustrate the difference between the approaches using 
artificial datasets. 

1) Modularity Criterion for Graph Clustering Methods: 
The goal of community detection is to partition a network into 
clusters of vertices with high edge density, with the vertices 
belonging to different clusters being sparsely connected. To 
evaluate the quality of a partition, the modularity Q lfT2) is a 
widely used criterion in recent community detection methods. 
The modularity measures the density of edges inside clusters 
as compared to the one expected in case of independence of 
the vertices. 

Given a graph G with n vertices and m edges, let rriij be 
an element of the adjacency matrix of the graph, rriij = 1 if 
vertices i and j are connected by an edge, rriij = 0 otherwise. 
The degree of a vertex i is defined by the number of edges 
incident upon it. In the case of undirected graphs, the ouput 
and inp ut deg ree of a vertice are equal. Using the notation of 
Section |III-B| we have 


























vrii, = rri'i 
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An undirected graph with mu edges corresponds to a 
symmetrical directed graph with m = 2 mu edges. Assuming 
that the vertex degrees are respected, the probability of a 
random edge between vertices i and j is rai.ra. j/ra 2 . The 
modularity Q is defined as 

Q = ( m *j ” ( 13 ) 

where is the index of the cluster to which 

vertex i is assigned, the ^-function 5(x,y) is 1 if x = 7/ 
and 0 otherwise and m = is twice the number of 

(undirected) edges. The modularity takes its values between 
-1 and 1 and has positive values when the clusters have more 
internal edges that the expected edge number if connections 
where made at random, with the same vertex degrees. The 
value of this criterion is 0 in the two extreme cases of 
one single cluster and of as many clusters as vertices. The 
modularity criterion has two appealing properties: it is well 
founded for the discovery of clusters with a density higher 
than the expected density when the extremities of the edges 
are independent, and it does not require any parameter, such 
as the number of clusters. 

Modularity has been used to evaluate the quality of par¬ 
titions for a large variety of methods such as hierarchical 
clustering, spectral clustering, random walks (see Section [I]), 
but also as an objective function to optimize. In this section, 
we compare our approach with the state of the art Louvain 
method [3], which is very fast and builds high quality partitions 
(measured by the modularity criterion). 

2) Artificial Graph Family: We introduce a family of 
artificial graphs consisting in four clusters of ten vertices, 
named A, B,C,D. For two-dimensional depiction purpose, we 
consider the case of undirected simple graphs, with at most 
one edge per pair of vertices and no loops, and control the 
proportion of potential edges per cocluster, that is per pair 
of clusters of vertices. This is illustrated in Figure [5j where 
the four clusters of vertices are drawn on circles for better 
readability. For example, choosing a proportion p = 20% for 
the edges of (A, B) means that 20% of the potential edges 
with one extremity in A and the other one in B (among 
100 = 10 * 10 edges) are in the graph. In the rest of the 
section, we illustrate the difference between graph clustering 
and coclustering using two graph patterns: quasi-cliques and 
cocliques. 

3) Quasi-cliques: Figure [5] presents a classical pattern con¬ 
sisting of four dense clusters, with an intra-cluster density of 
80% and an inter-cluster density of 10%. The parameters of the 
edges distribution are shown on the left, then an example of a 
graph generated according to this distribution, and on the right 
the clusters retrieved using our approach and the modularity- 
based approach, with a different color per cluster. The cluster 
based and coclustering methods (modularity method of 0 
and our approach) obtain the same result, with a correct 
identification of the four clusters related to A, B,C,D (with 
Q = 0.409). In the case of a graph that can be decomposed into 
dense clusters, the two approaches exhibit the same behavior. 
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Fig. 5. Artificial graph: quasi-cliques. 

4) Co cliques: Figure 6] presents a pattern consisting of four 
cocliques, which are subgraphs with no inner edge, and an 
edge density across cocliques of 50%. In this example, the 
intra-cluster density is far below the average density. Actually, 
not all graphs have a structure consisting of natural clusters. 
Yet, all clustering algorithms output a partition into clusters for 
any input graph, and the cluster based algorithm builds dubious 
clusters in this case. Our approach correctly retrieves the four 
empty cocliques. In this case, the modularity is negative (- 
0.251), which reflects the fact that the ratio between observed 
and expected edge density is far below 1. 
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Fig. 6. Artificial graph: cocliques. 

B. Comparison with Stochastic Blockmodeling 

In this section, we focus on the blockmodeling approach 
and report comparative experiments on two kinds of artificial 
graphs: blockmodel graphs and random graphs. 

1) Blockmodel Graphs: Our method is a piecewise constant 
non-parametric edge density estimator in directed multigraphs. 
One of the closest existing approach is the statistical block¬ 
modeling one, and we used the StOCNET software for com¬ 
parison purpose. StOCNET Ea is a software system for the 
advanced statistical analysis of social networks, which imple¬ 
ments the stochastic blockmodeling method named BLOCKS 
of Nowicki and Snijders [20], which applies to simple graphs. 
Although StOCNET requires the number of blocks as an input 
parameter, the tool can compute the blockmodeling structure 
for numbers of blocks between 2 and 8. The fit of the block 
structure is evaluated using the clarity of the block structure 
l20l . The authors suggest to keep the model with the best fit 
(smallest clarity) to select the best number of blocks. 

We evaluate the ability of the approach to retrieve the 
correct block structure using a graph consisting of 100 vertices 
with three clusters: 30 vertices in cluster A, 40 in cluster 
B and 30 in cluster C. The distribution of the edges across 
the clusters is summarized in Table |V-B1[ where the source 
and target vertices of each edge are uniformly distributed 
within each cluster. For each sample size 100, 200, ..., 1,000 
edges, we generate one hundred datasets according to this 
edge distribution and run both the StOCNET software and our 
method with their default settings. The computation time of 
StOCNET is on average 210 seconds, while our methods takes 
on average 0.2 seconds. 





















TABLE II. Artificial blockmodel distribution with three 

CLUSTERS. 

ABC 

30% 

40% 

30% 

£ 30% 40% 30% 100.00% 


30% 

0% 

0% 

0% 

10% 

30% 

0% 

30% 
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In Figure [7J we report the mean plus/minus the standard 
deviation of the selected block number, both for the StOCNET 
sofware and our approach. While the clarity criterion of 
StOCNET selects on average the correct number of clusters 
for sufficiently large sample size, the selection approach is 
not resilient to sample variability. Our approach builds one 
single cluster for less than 200 edges and exactly three clusters 
beyond 600 edges, with a transition between 300 and 500 
edges where two clusters (A versus B U C) are sometimes 
selected. 




Overall, our method avoids the problem of parametric ap¬ 
proaches where the number K of clusters is a user parameter: it 
neither suffers from under-fitting (K too small) or over-fitting 
(K too large). 

C. Comparison with MDL Coclustering 

In this section, we compare our approach with the Cross- 
Association method l33l . As pointed out in Section [II] this 
method is close to our approach: it performs a coclustering on 
(binary) matrices, is parameter-less and based on MDL, and is 
scalable, allowing experiments on large datasets. 

1) Block-Diagonal Graphs: Let us introduce block- 
diagonal graphs as directed multigraphs with a partition of 
the source (resp. target) vertices into K clusters (chosen of 
equal size in this experiment), with edges lying in the related 
diagonal blocks. For each randomly chosen source vertex, 
an edge is generated randomly in the target cluster related 
to same block. In a noisy version of this pattern, edges are 
generated randomly among the whole graph with probability 
p, and according to the block-diagonal pattern with probability 
(1 — p). The contingency matrix of some samples of noisy 
block-diagonal graphs is drawn in Figure [8] 



Fig. 7. Blockmodel graph: mean number of clusters per sample size, 
plus/minus the standard deviation 


Fig. 8. Sample of Noisy(100, 10), Noisy(1000, 5) and Noisy(1000, 100) 
block-diagonal graphs 


This first experiment shows that our method behaves as 
a blockmodeling approach w.r.t. the retrieved patterns. Being 
non-parametric, it benefits from a regularized criterion to 
reliably estimate the correct granularity of the blockmodel 
pattern. 

2 ) Random Graphs: The Erdos-Renyi f52l random graph 
datasets used for tests were introduced by Johnson et al. [ 54]. 
These graphs are undirected simples graphs, that we treat as 
symmetric directed graphs with twice the number of edges in 
the adjacency matrix. The 16 available random graphs have 
vertex numbers 124, 250, 500 and 1,000, with average total 
degree 2.5, 5, 10 and 20. 

As StOCNET is limited to graphs with at most 200 vertices, 
we could apply it only on the random graphs with 124 vertices. 
StOCNET takes 380 seconds on average on each 124 vertices 
graphs, while our method takes 0.4 second on average for these 
small graphs and up to 18 seconds for the largest graph with 
1,000 vertices and 20,000 edges. 

On the 124 vertices graphs, for each average vertex degree, 
the clarity criterion of StOCNET significantly decreases with 
the number of blocks, leading to the choice of maximum 
number of blocks considered (8 in the software). Using this 
criterion does not lead to a reliable choice of the block number 
in case of noisy data. Conversely, on all the random graphs up 
to 1,000 vertices and 20,000 edges, our method builds one 
single cluster, which confirms its high resilience to noise. 


In the experiments, we use the same number of source 
and target vertic es and consider the block-diagonal graphs 
described in Table |V-CT The Pure and Noisy graphs are block- 
diagonal graphs with a variety of number of vertices and 
blocks, whereas the Random graphs reduce to Erdos-Renyi 
random directed multigraphs. For all power of 2 sizes 1, 2, 4, 
8, ... up to one million edges, we generate ten random graph 
samples according to the ed ge dis tribution of each artificial 
graph summarized in Table V-Cl We run both the Cross- 
Association software 0 and our method with their default 
settings, and collect the mean cluster number and computation 
time. 

2) Cluster Number: In Figure |9| and [T0| we report the mean 
cluster number per sample size obtained with the MODL and 
Cross-Associa tion m ethods for each artificial graph summa¬ 
rized in Table IV-C1I 

For each graph, we observe three phases in Figure [9] 
for the MODL method, as in the experimental convergence 
study in Section [TV-D stability phase , with one single cluster, 
non-parametric estimation phase , with a growing number of 
clusters, and classical estimation phase , where the true cluster 
number is retrieved. The second phase is very sharp for the 


9 The Cross-Association software can be downloaded at 
http://www.cs.cmu.edu\discretionary{-}{}{}/$\sim$deepay/mywww/ 
software/CrossAssociations-01-27-2005.tgz I am grateful to D. Chakrabarti 
for making his method available and providing guidance in its use. 




































TABLE III. 


Block-diagonal graphs 


Name 

#Vertices 

#Blocks 

Noise rate 

Pure(10, 2) 

10 

2 

0% 

Pure(100, 10) 

100 

10 

0% 

Pure(1000, 5) 

1000 

5 

0% 

Pure(1000, 100) 

1000 

100 

0% 

Pure( 10000, 200) 

10000 

200 

0% 

Noisy(10, 2) 

10 

2 

50% 

Noisy(100, 10) 

100 

10 

50% 

Noisy(1000, 5) 

1000 

5 

50% 

Noisy(1000, 100) 

1000 

100 

50% 

Noisy( 10000, 200) 

10000 

200 

50% 

Random(lO) 

10 

1 

100% 

Random(lOO) 

100 

1 

100% 

Random(lOOO) 

1000 

1 

100% 

Random( 10000) 

10000 

1 

100% 


Cluster nb 



-«-Pure(10,2) 

— Noisy(10,2) 

—*— Pure(100,10) 

—♦—Noisy(100,10) 
Pure( 1000,5) 

—*^Noisy(1000,5) 
-»-Pure(1000,100) 

—Noisy(1000,100) 
-*-Pure(10000,200) 
—Noisy(10000,200) 
-o-Random(lO) 

—O-Random(lOO) 
—A— Random(lOOO) 

—O— Random(lOOOO) 


Fig. 9. Block-diagonal graphs: mean number of clusters per sample size 
using the MODL method 


block-diagonal graphs: below a threshold, which depends on 
the complexity of the pattern (number of vertices and of 
blocks), only one cluster is retrieved, and above about four 
times this threshold, the correct number of clusters is retrieved. 
The recognition threshold mainly increases with the number 
of vertices: simpler patterns require less edges to be identified, 
from 50 edges for the simplest Pure(10, 2) graph to 250,000 
edges for the most complex Noisy( 10000, 200) graph. In 
case of noisy data (with 50% noise), the shape of the curves 
is approximately the same, with a translation towards larger 
sample sizes: the noisy patterns require around four times the 
edge number necessary to retrieve the related pure patterns. 
Finally, the MODL method is highly resilient to noise: it never 
produces more cluster than expected, and always outputs one 
single cluster in the extreme case of random graphs. 


Cluster nb 



-»-Pure(10,2) 

—Noisy(10,2) 

—♦— Pure(100,10) 

—•—Noisy(100,10) 
-*-Pure(1000,5) 
-Noisy(1000,5) 
-*-Pure(1000,100) 

-»-Pure( 10000, 200) 
—•—Noisy(10000,200) 
—o—Random(lO) 

-O—Random(lOO) 
-i^-Random(lOOO) 

—O— Random(lOOOO) 


Fig. 10. Block-diagonal graphs: mean number of clusters per sample size 
using the Cross-Association method 

The results obtained by the Cross-Association method in 
Figure [lO] are quite unclear. Still, this blurred behavior may 
be explained by the following reasons. First, in case of pure 
patterns, the top-down algorithm can get stuck in local minima, 
resulting in an under-fitting behaviour in case of patterns 


with many clusters. Second, in case of random patterns, as 
acknowledged in ED, the method tends to pick up spurious 
patterns and more generally to over-fit the data. Last, the Cross- 
Association is designed for binary matrices related to directed 
simple graphs, not to directed multigraphs. Therefore, as any 
noisy multigraph is asymptotically flattened to a complete 
simple graph, the Cross-Association method produces one 
single cluster for noisy patterns with many edges. 


3) Computation Time: The algorithmic complexity of the 
MODL method (main greedy bottom-up heuristic described in 
Algorithm [l]) is 0(m^/m logm ), where m is the number of 
edges. According to (331, the algorithmic complexity of the 
Cross-Association method is 0(m(k* + P) 2 ), where k * and 
P are the number of source and target clusters retrieved by 
the top-down heuristic. Although this is quadratic w.r.t. the 
cluster number, this may be more efficient that our bottom-up 
approach in case of patterns with small number of clusters. In 
Figure |TT] and [12| we report the mean computation time per 
sample size obtained with the MODL and Cross-Association 
methods. 





-«-Pure(10,2) 

—Noisy(10,2) 

—«— Pure(100,10) 

—•—Noisy(100,10) 
-*-Pure(1000,5) 

—*— Noisy(1000,5) 
-*-Pure(1000,100) 
-Noisy(1000,100) 
-*-Pure(10000,200) 
—»-Noisy(10000,200) 
—o—Random(lO) 
-O-Random(lOO) 
-£-Random(1000) 

—o— Random(lOOOO) 


Fig. 11. Block-diagonal graphs: mean computation time per sample size 
using the MODL method 





-«-Pure(10,2) 

— Noisy(10,2) 

—♦— Pure(100,10) 

—— Noisy(100,10) 
-*-Pure(1000,5) 

—*—Noisy(1000,5) 
-*-Pure(1000,100) 

-Noisy(1000,100) 

-*-Pure(10000,200) 
—»-Noisy(10000,200) 
—o—Random(lO) 
-O-Random(lOO) 
-A-Random(lOOO) 

—o— Random(lOOOO) 


Fig. 12. Block-diagonal graphs: mean computation time per sample size 
using the Cross-Association method 


Compared to the theoretical computation complexity, the 
actual computation time depends on many factors, such as 
the numbers of vertices, the size of the true pattern and the 
level of noise. In the case of the MODL method (Figure [TT]), 
the shape of the computation time curve in the worse case 
is compliant with the theoretical complexity, with up to ten 
thousand seconds for the largest noisy graph. In the case of the 
Cross-Association method (Figure |T5]), the shape of the curves 
grows quadratically as expected, but is flattened in the end 
with no more than one thousand seconds for the largest noisy 
graphs. This behavior comes from the bottom-up heuristic 
which is stuck in local minima and hardly produces more than 
around 20 clusters, which allows the method to be quicker at 
the expense of under-fitting the data. 

Overall, whereas our method and the Cross-Association 















































methods are both scalable, parameter-less and based on a 
similar MDL-based model selection approach, the modeling 
choices are quite different, resulting in contrasting behavior. 
Our method is valid both non-asymptotically and asymptoti¬ 
cally and neither suffers from under-fitting nor over-fitting. It 
never produces spurious clusters in case of random graphs and 
recovers complex patterns with an increased precision as the 
amount of available data increases. 




VI. Experiments on Real World Datasets 

This section compares our method with the Cross- 
Association method |33 on real world datasets coming from 
three different domains: document clustering, webspam detec¬ 
tion and exploratory analysis of a flight trip dataset. 


A. Document Dataset 

We exploit the CLASSIC3 document dataset p| introduced 
in ED to evaluate the Information-Theoretic Coclustering 
ITC method. This collection of documents comes from three 
domains: MEDLINE consists of 1,033 abstracts from medical 
journals, CISI consists of 1,460 abstracts from information 
retrieval papers and CRANFIELD consists of 1,398 abstracts 
from aerodynamic systems. Overall, this dataset can be rep¬ 
resented as a directed multigraph with 3,891 source vertices 
(documents), 5657 target vertices (words) and 287,827 edges 
(184,772 edges in a flattened binary representation of the 
graph). 


We build a coclustering of the dataset using the MODL 
method described in Section III The running time is one hour 
and 50 minutes. We obtained a very fine-grained summary of 
the dataset, with 151 clusters of documents and 293 clusters 
of word. 


Agglomerative hierarchical clustering method: In 

order to explore the coclustering at different granularities, we 
suggest to coarsen the obtained coclustering by the mean of 
an agglomerative hierarchical clustering. We use the following 
similarity between two clusters i and j 

S(h3) = c ( M iuj) ~ c(M), 

p(M\G) (14) 

^ p( Miuj |G) 

where c(M) is the evaluation criterion 0 introduced in 
Section [III] M is the current coclustering model and 
is the coclustering model after the merge of clusters i and j. 
Intuitively, if two clusters are similar, the total code length of 
the data (see criterion c(M)) is not much different between 
the cases where the clusters are coded jointly or separately, so 
that S will be small. Actually, the agglomerative hierarchical 
clustering algorithm is the same as that of Algorithm [I] except 
that the merges are performed until the required number of 
clusters is obtained. 


As the CLASSIC3 dataset comes from three domains, we 
chose to coarsen our fine-grained 151*293 coclustering matrix 
down to a 3*3 coarse matrix. The fine and coarse grained 
matrix are shown in Figure |~j~3| 

10 Dataset available at http://www.dataminingresearch.com/index.php/2010/ 
09/classic3- classic4- datasets/ 


Fig. 13. Coclustering of the CLASSIC3 dataset: initial fine-grained 151*293 
coclustering matrix and coarse 3*3 matrix. 


Applying the Cross-Association CA method, we obtain a 
20*20 summary of the flattened binary dataset in about six 
minutes. 

We evaluate the agreement between the coclustering results 
and the known documents classes for the MODL and CA 
methods: 


• MODL: we exploit the coarse 3*3 matrix to correlate 
the three obtained clusters of document with the 
known documents classes, 

• CA: we collect the majority class in each obtained 
cluster of document and manually group the clusters 
sharing the same majority class, down to three clus¬ 
ters. 


The results are reported in Figure 14 In ED, the ITC 
method is applied with three clusters as a user parameter, 
and obtains a good agreement between the retrieved clusters 
and the true classes, with about 60 agreement errors. The CA 
methods obtain a better agreement than the ITC method, with 
about 20% less errors. The results are better for the MODL 
method with twice less agreement errors than for the CA 
method. 


MODL 
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CISI 

CRAN 

E 

CA 
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3 

12 
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1413 
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1 

1379 

1382 

E 

1033 

1460 

1398 

3891 

E 

1033 

1460 

1398 

3891 


Fig. 14. CLASSIC3: agreement with the true classes for the MODL and CA 
methods. 

Overall, the MODL method produces a finer coclustering 
than the CA method, at the expense of a a higher computation 
time. This finer model better fits the data, with a better 
agreement with the true classes in the dataset. 

B. Web Spam Dataset 

In this section, we evaluate the benefit of our method as a 
preprocessing step for web spam detection. 

1) Web Spam Challenge 2007: Web spam consists in 
manipulating the relevance of resources indexed in a manner 
inconsistent with the purpose of the indexing system of internet 
search providers. The data used in this paper comes from 
the Web Spam Challenge (corpus #1, Track II) []] held in 
conjunction with the 2007 ECML/PKDD Graph Labeling 

11 http://webspam.lip6.fr/ Web Spam Challenge 2007 

























workshop. The goal of the challenge is to evaluate machine 
learning methods to label web hosts to be spam or normal. 
The data consists of 9,072 web hosts with both content and 
link data. Content data correspond to the TF-IDF vectors over 
100 web pages of the host, with almost 5 millions features. The 
hosts are the vertices and the link data represent the edges in 
the directed host graph, with one edge per hyperlink between 
two hosts. Overall, the host graph contains 514,700 edges, all 
of them are simple edges. The degrees of the vertices follow 
the power-law distribution, as shown in Figure [15] The training 
set consists of 907 hosts labelled as normal or spam, with 
approximately 20% spam. The objective of the challenge is to 
label the test set (8,165 hosts). The result is assessed using the 
area under the ROC curve (AUC). 




Fig. 15. Webspam challenge host graph: number of vertices per output degree. 

The results of the challenge participants. are reported 
on the right of Figure [IT] They all used both the content 
and link data in a semi-supervised learning setting, on top 
of classifiers among which SVM, random forests and naive 
Bayes. All the available data is exploited to build a model: 
train and test content and link data, plus the train labels. The 
link data is exploited either by extracting link-based features, 
such as number (or ratio) of links from (or to) to spam or 
normal hosts, or by constraining the classifier to account for 
the labels of the connected hosts. 

2) Evaluation: In a first step, we exploit the link data 
only, that is the directed graph consisting of 9,072 hosts 
with 514,700 links. All the hosts and links are processed 
without any output label, to identify the “natural” clusters 
of source and target hosts. We build a coclustering of the 
source and target hosts using the MODL method described 
in Section m The running time is 3 hours and 14 minutes. 
The coclustering of the host graphs retrieves 167 clusters of 
source hosts and 219 clusters of target hosts. The coclustered 
matrix is displayed in Figure |T6] which shows that the graph of 
hosts is highly structured, with most of the information lying 
in few hundred of coclusters. The asymmetry in the hyperlinks 
of the host graph conforms to the observation of the challenge 
participants, that there is usually no link from a normal host 
to a spam host. 

Applying the alternative Cross-Association method, we ob¬ 
tain a 10*10 coclustering of the dataset in about five minutes. 

In a second step, the available labeled train hosts are used 
to describe the output distribution in each cluster of hosts. The 
label of a test host is then predicted according to the output 
distribution of its cluster. However, the MODL coclustering 
is fine grained, and about 20% of the clusters do not contain 

12 Available at http://airweb.cse.lehigh.edu/2008/web_spam_challenge/ 
introduction.pdf Computation time of the participants is not available. 


Fig. 16. Coclustering of the host graph of the web spam challenge. 


a single train instance: in this case the test host is labeled 
as the majority class (normal). To alleviate this problem, we 
chose to coarsen our coclustering at different grain levels with 
the agglomerative hierarchical clustering method described in 
Section VI-A| We build three new coarsened coclusterings, 
starting from the initial 386 = 167 + 219 clusters down to 
200, 100 and 50 clusters. Given this preprocessing, each host 
is represented by eight variables, source or target cluster at four 
grain levels. On the left of Figure [IT] we report the test AUC 
obtained by each univariate cluster-based classifier for the 
MODL coclustering ( S . clust(k) and T. clust(k) for classifiers 
based on source or target clusters for coclusterings with k 
clusters). We also combined these classifiers using a Naive 
Bayes (NB) and a Selective Naive Bayes (SNB) classifier 
l55ll . Using the same protocol, we report the test AUC results 
obtained using the 10 times 10 coclustering retrieved by the 
Cross-Association method on the center of Figure [IT] Finally, 
we also report the results of the challenge participants on the 
right of Figure [17] The test AUC results are obtained using the 
predefined train/test split of the dataset to allow a comparison 
with the challenge participants. No cross-validation results are 
available for the challenge participants. Given that the size of 
the test set is s = 8165, the expected variance of the results 
is around -4= « 1%. 


Test AUC MODL result CA results Challenge results 



Fig. 17. Webspam 2007, phase II: test AUC results for the MODL and CA 
methods, and of the challenge participants. 

The Cross-Association method under-fits the data, with an 
insufficient number of clusters to correctly identify the spam 
class. For both the MODL and CA methods, the results show 
that the target clusters are much more informative than the 
source clusters, with 5 to 10% better test AUC. The granularity 

































































of the coclustering has a medium impact on the MODL test 
AUC: the best results is obtained with about 200 clusters. The 
combined classifiers manage to exploit all the input variables 
to get superior performance. It is noteworthy that the (MODL) 
SNB classifier obtains a test AUC comparable to that of the 
winner of the challenge, while using the link data only. 

Like in the document clustering experiment, the MODL 
method produces a finer coclustering than the CA method, at 
the expense of a a higher computation time. This finer model 
better fits the data, with competitive performance for the task 
of web spam prediction. 


C. Flight Trip Dataset 

In this section, we apply our method on a large passenger 
flight trip dataset for the purpose of exploratory analysis, when 
no ground truth is available. The dataset [^] comes from the 
US Bureau of Transportation Statistics and contains a 10% 
sample of all airline tickets for each domestic flight trip. We 
collected the origin and destination fields as well as the number 
of passengers per trip for the four terms of year 2010, so as 
obtain 10% of all US domestic passenger flight trips for one 
full year. The resulting data table contains 22,038,685 records, 
for a total of 469 airports and 43,092,916 flight trips. The 
airports are the vertices and the trips are the oriented multiple 
edges from origin to destination airports. The average edge 
multiplicity in this graph is 471, with 91,482 different edges. 

We apply the MODL coclustering method on this dataset, 
with a running time of 16 minutes 40 seconds. We obtain a 
quasi symmetric summary of the multigraph with 233 clusters 
of origin airports, 234 clusters of destination airports, and 
46,325 non empty coclusters. 


TABLE IV. US FLIGHT TRIPS AND THEIR COCLUSTERING SUMMARY. 



PI 

WC 

FL 

DV 

CE 

PI 

1.03% 

0.92% 

0.03% 

0.09% 

0.49% 

WC 

0.89% 

17.60% 

1.79% 

3.64% 

10.54% 

FL 

0.03% 

1.69% 

0.30% 

2.74% 

6.15% 

DV 

0.09% 

3.77% 

3.13% 

0.36% 

6.41% 

CE 

0.28% 

11.06% 

5.60% 

6.31% 

15.07% 

E 

2.32% 

35.04% 

10.85% 

13.14% 

38.66% 


E 

2.56% 

34.46% 

10.91% 

13.76% 

38.32% 

100 . 00 % 


In order to explore the extracted coclustering with a 
broader picture, we chose to coarsen our coclustering with 
the agglomerative hierarchical clustering method described in 
Section [VI-A| For depiction purpose, we keep five clusters: the 
coclu stering matrix summary is almost symmetric, as shown in 
Table |VI-C[ Figure [T8] displays the five clusters of destination 
airports. There is a clear geographic correlation in the clusters. 
A first cluster consists of the Pacific Islands (PI), with Hawai, 
Mariana Islands, Guam and Samoa. A second cluster roughly 
corresponds to the Delaware Valley (DV), with counties from 
Pennsylvania, New Jersey, Delaware and Maryland. A third 
cluster contains the Florida airports (FL), while the two last 
clusters correspond the West Coast (WC) and Center and East 
(CE) of America. 

Interestingly, the Pacific Islands cluster is very dense, 
with 17 times more intra-cluster flight trips than expected 

13 http://www.transtats.bts.gov/DataIndex.asp Airline Origin and Destina¬ 
tion Survey (DB1B) 2010, RITA: Bureau of Transportation Statistics 
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Fig. 18. US flight trips: five clusters of airports. 


in case of independence of the origin and destination of the 
trips (1.03% ^ 17 * (2.56% * 2.32%)). On the contrary, 
the Florida and Delaware Valley clusters are sparse clusters, 
with respectively four times and five times less intra-cluster 
trips than expected in case of independence. Although these 
two sparse clusters are not geographically connected, they are 
linked by twice the number of trips than expected. This kind 
of exploratory analysis can be performed at any granularity up 
to the finest coclustering retrieved by our method. 

This illustrates how the MODL coclustering method can 
be used for the task of exploratory analysis, when no ground 
truth is available. 


VII. Conclusion 

In this paper, we have presented a novel way of discovering 
structures in graphs, by considering graphs as generative 
models whose statistical units are the edges, with unknown 
joint density of the source and target vertices. Our method 
applies the MODL approach based on data grid models 0 
to the case of directed multigraphs. By clustering both the 
source and target vertices of a graph, the method behaves 
as a non-parametric estimator of the unknown edge density. 
The modeling approach exploits the MDL principle in a data- 
dependent way: it aims to model the finite graph sample 
directly. The modeling task is then easier, with finite modeling 
space and model prior which essentially reduce to counting. 
This modeling approach is both non-asymptotic and consistent, 
with an asymptotic convergence towards the true edge density, 
without any assumption regarding this density. 

Experiments on artificial data demonstrate that our ap¬ 
proach is both robust, building one single cluster in case 
of random graphs, and accurate, being able to recover fine 
grained patterns. Our method has been applied to a document 
classification problem and to a Web spam detection problem 
in a graph of hosts. The patterns retrieved by our approach 
are highly informative, with a good agreement with the true 
classes. An application on a large flight trip dataset shows the 
potential interest of our method for exploratory analysis, with 
the extraction of insightful and interpretable patterns. 

Our method has a 0(rn^/rn log m) time complexity, where 
m is the number of edges, providing practical running times 
up to millions of edges. In future work, we plan to work on 
faster and more scalable optimization algorithms in order to 
deal with very large graphs, up to billions of edges. 


















Appendix 


APPENDIX 


A. Proof of theorems 

In this appendix we prove theorems [2| and [3] from Sec¬ 
tion ied 


Theorem [2] The MODL evaluation criterion (Equation^ for 
a graph coclustering model M is asymptotically equal to m 
times the entropy of the source and target vertex variables 
minus the mutual entropy of the variables grouped. 


c(M) = m (. H(V S ) + H(V t ) ~ I(V S M ; V?)) + O(logm). 


z(M) = 


ks f T 

rri log m — mf^ log n 

i =i j=i 

/ ks > 

— m log m — ^ mf log mf 
\ i =1 > 

kT 


ST 


m 


log m — m T j log m J 
3 = 1 

n s 

log m — ^ rrii . log 


2=1 

Ut 


i log m — rri'j log mj 

3 = 1 

O(logm). 


( 16 ) 


Proof: According to Equation [T] we have: 


c(M) = log ns + log ut 

+ log B(n s , k s ) + log 5(n T , fer) 
+ kE — 1 N 


+ log 




kE — 1 


2=1 X 4 


kj 1 

X 

7=1 


^log/ m 5+ n J 1 


nj -1 


fes kj 1 

logmi-y^T, log m § T! 

*=1 7=1 
fes ns 

log raf! — log mi ,! 

2=1 2=1 

kT TiT 

Y log m T j ! — log rri j ! 

7=1 3=1 


(15a) 

(15b) 

(15c) 

(15d) 

(15e) 

(15f) 

(15g) 

(15h) 


We study the asymptotic behavior of the criterion when 
the number of edg es gr ows to infinity, for a fixed set of 
vertices. Lines (15a) and )|15b|> of criterion c(M) are bounded 


by constants w.r.t. m. Since the numbers of clusters (ks, hp) 
and the numbers of vertices per cluster (n£, n;Q_ are bou nded 
by the number of vertices (ns, nr), lines < |15c| ), ( |15d| ) and ( |15e| ) 
of the criterion, corresponding to the encoding of the model 
prior parameters, are bounded by O(logm). 


We now focus on the likelihood terms of the criterion 
(lines ( |15f| ), ( |15g| ) and ( |15h[ ). Using the approximation log n\ = 
n(logn — l) + O(logn) based on Stirling’s formula and 
rearranging the terms with new m log m terms, we get: 


Given that the sum of the edge counts in each partition 
(per cocluster, per cluster in and out-degree and per vertex in 
and out-degree) is always equal to m, we obtain: 


c(M) 



ns n T 

E 'kkli. T rn i. V^ 

-log- my 

m m ' 

2=1 j=l 

+ O(logm). 



m -3 

m 


(17) 


As the marginal distributions mf and m T - can be decom¬ 
posed by summation on the joint distribution raf- T , we have: 


ks k T ST ™ i:j 

c(M) = — m V V —— log — m T 

v 7 m m s m 1 . 


m 


E m i. 
— 

2-1 m 


= 1 7 = 1 

J mm 

ns n T 1181 

~~ t m i m i n m i K J 

1 log — 1 —my —— log —— 
rri m m 

3 = 1 

O(logm). 


Considering that the empirical estimation asymptotically 
converges towards the related probabilities, the claim follows. 


Theorem [3] The MODL approach for selecting a graph co¬ 
clustering model M asymptotically converges towards the true 
edge distribution, and the criterion for the best model Ms e st 
converges to m times the entropy of the edge variable, which 
is the joint entropy of the source and target vertices variables. 

lim C ( Mge ^ = H(V s ,Vt). 

m—>-00 m 


Proof: Using Theorem [2] we have 
c(M ) = -mI(Vi I -,VP)+mH(V s )+mH(V T ) + 0(\ogm). 













We apply the Data Processing Inequality (DPI) ED, which 
states that post-processing cannot increase information. More 
precisely, the DPI applies for three random variables X, Y, Z 
that form a Markov chain X Y Z. It means that 
the conditional distribution of Z depends only on Y and is 
conditionally independent of X. More specifically, for three 
random variables such that p(Z\X,Y) = P(Z\Y ), the DPI 
states that /(X; Y) > /(X; Z) 

We apply the DPI to the variables Vs,Vt,V^ . As the 
vertex cluster variable VjK can be computed according to a 
partition of the vertex variable Vt ( V^ = f(Vr)), we have 
p{V^ | Vs, Vt) = piV ^ 1 1 Vt) and thus obtain: 


I{V s -,Vt)>I{VsTt)- (19) 

We apply again the DPI to the variables Vs, Vg 4 . As 
the vertex cluster variable VjV is a function of Vs, we have 
P(V s M \Vs,Vf)=p(V s M \V s ) and get: 


I{V™\ Vs) > I(V r M ; V s m ). (20) 


By transitivity and since the mutual information is sym¬ 
metrical, we get: 


I(V s -,Vt)>I(V s m -,Vt)- (2D 


It is noteworthy that this result applies to compare any 
pair of coclustering models, one of the models being a sub¬ 
partition of the other: the finer model brings a higher mutual 
information. 

The model selection approach corresponds to a minimiza¬ 
tion of the MODL criterion. Let us now show that the best 
selected model asymptotically tends to be finer and finer, 
until reaching the finest possible model with one cluster per 
vertex, which is the maximal model My [ ax that enables a direct 
estimation of the edge probabilities Pij : 

I(V S \ V T ) = I{V s MMax ; vf Max ) > I(V £ tf ; V?). (22) 


If V M, I(vf M ' M ; vf Max ) = I (Vi 4 : Vf), then using 
Theorem [2] the MODL approach asymptotically converges 
towards the true edge distribution. 


If 3 MpM c ,I(vf Max -,vf Max ) = I{vf f \Vf s ) > 

I(V s Mc ;V t c ), with Mf a fine-grained model having the 
same mutual information as the maximal model and M c a 
coarse-grained model, then let us show that the approach 
asymptotically selects the fine-grained model Mf rather than 
the coarser model M c . 


Let e 


HV S f X T f )-i{v^ c XT c ) 


Using Theorem [2] for the convergence of the criterion for 
model M c , 


3 mi, V m > mi, 
c(M c ) 


(h(V s ) + H(Vt) - I(V S M <-, V,f<)) 


e 

<2- 


Similarly, for model Mf , 


3 m 2 , V m > m 2 , 

- (H(Vs) + H(V T ) - KV«’;V?')) 


e 


<2- 


Thus, 


V m > max (mi, m2), 

> H (y s ) + H(V t ) - I(V S M ‘-, 

m 

AMA < Hi y s ) + H(Vt) - 

rn 



V m > max(mi,m 2 ), 


c(M f ) 

m 

cjMf) 

m 


_ cfMi < I{y M c . V M C) _ I{Vs Mf -,Vp) + c, 
^ c(M c ) 


Since the model selection approach corresponds to a min¬ 
imization of the MODL criterion, this means that the best 
selected model Mgest asymptotically tends to be a fine-grained 
model M / having the same mutual information as the maximal 
model MMax, which allows the estimation of the true edge 
distribution. Using Theorem [2] with the maximum model (limit 
of the best selected model), we have: 

c(M u ax ) = -mI(Vs;VT)+mH(Vs) + mH(V T ) + 0 (\ogm). 


As I(X;Y) 
follows. 


H(X) + H(Y) - H(X,Y), the claim 
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